{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": 304,
   "metadata": {
    "tags": [
     "remove-cell"
    ]
   },
   "outputs": [],
   "source": [
    "import sys\n",
    "import os\n",
    "if not any(path.endswith('textbook') for path in sys.path):\n",
    "    sys.path.append(os.path.abspath('../../..'))\n",
    "from textbook_utils import *"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "(sec:fake_news_data)=\n",
    "# Obtaining and Wrangling the Data"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's get the data into Python using the [GitHub page for FakeNewsNet](https://github.com/KaiDMML/FakeNewsNet/tree/654361e1c8d5baa751baf1dac5032df621652280). Reading over the repository description and code, we find that the repository doesn't actually store the news articles itself. Instead, running the repository code will scrape news articles from online web pages directly (using techniques we covered in {numref}`Chapter %s <ch:web>`). This presents a challenge: if an article is no longer available online, it likely will be missing from our dataset. Noting this, let's proceed with downloading the data."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    ":::{note}\n",
    "\n",
    "The FakeNewsNet code highlights one challenge in reproducible research---online datasets change over time, but it can be difficult (or even illegal) to store and share copies of this data. For example, other parts of the FakeNewsNet dataset use Twitter posts, but the dataset creators would violate Twitter's terms and services if they stored copies of the posts in their repository. When working with data gathered from the web, we suggest documenting the date the data were gathered and reading the terms and services of the data sources carefully.\n",
    "\n",
    ":::"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Running the script to download the Politifact data takes about an hour.\n",
    "After that, we place the datafiles into the _data/politifact_ folder.\n",
    "The articles that Politifact labeled as fake and real\n",
    "are in _data/politifact/fake_ and _data/politifact/real_.\n",
    "Let's take a look at one of the articles labeled \"real\":"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 305,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "total 0\n",
      "drwxr-xr-x  2 sam  staff  64 Jul 14  2022 \u001b[34mpolitifact100\u001b[m\u001b[m\n",
      "drwxr-xr-x  3 sam  staff  96 Jul 14  2022 \u001b[34mpolitifact1013\u001b[m\u001b[m\n",
      "drwxr-xr-x  3 sam  staff  96 Jul 14  2022 \u001b[34mpolitifact1014\u001b[m\u001b[m\n",
      "drwxr-xr-x  2 sam  staff  64 Jul 14  2022 \u001b[34mpolitifact10185\u001b[m\u001b[m\n",
      "ls: stdout: Undefined error: 0\n"
     ]
    }
   ],
   "source": [
    "!ls -l data/politifact/real | head -n 5"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 306,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "total 16\n",
      "-rw-r--r--  1 sam  staff   5.7K Jul 14  2022 news content.json\n"
     ]
    }
   ],
   "source": [
    "!ls -lh data/politifact/real/politifact1013/"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Each article's data is stored in a JSON file named _news content.json_.\n",
    "Let's load the JSON for one article into a Python dictionary (see {numref}`Chapter %s <ch:web>`):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 307,
   "metadata": {},
   "outputs": [],
   "source": [
    "import json\n",
    "from pathlib import Path\n",
    "\n",
    "article_path = Path('data/politifact/real/politifact1013/news content.json')\n",
    "article_json = json.loads(article_path.read_text())"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here, we've displayed the keys and values in `article_json` as a table:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 308,
   "metadata": {
    "tags": [
     "hide-input"
    ]
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>value</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>key</th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>url</th>\n",
       "      <td>http://www.senate.gov/legislative/LIS/roll_cal...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>text</th>\n",
       "      <td>Roll Call Vote 111th Congress - 1st Session\\n\\...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>images</th>\n",
       "      <td>[http://statse.webtrendslive.com/dcs222dj3ow9j...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>top_img</th>\n",
       "      <td>http://www.senate.gov/resources/images/us_sen.ico</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>keywords</th>\n",
       "      <td>[]</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>authors</th>\n",
       "      <td>[]</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>canonical_link</th>\n",
       "      <td></td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>title</th>\n",
       "      <td>U.S. Senate: U.S. Senate Roll Call Votes 111th...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>meta_data</th>\n",
       "      <td>{'viewport': 'width=device-width, initial-scal...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>movies</th>\n",
       "      <td>[]</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>publish_date</th>\n",
       "      <td>None</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>source</th>\n",
       "      <td>http://www.senate.gov</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>summary</th>\n",
       "      <td></td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                                            value\n",
       "key                                                              \n",
       "url             http://www.senate.gov/legislative/LIS/roll_cal...\n",
       "text            Roll Call Vote 111th Congress - 1st Session\\n\\...\n",
       "images          [http://statse.webtrendslive.com/dcs222dj3ow9j...\n",
       "top_img         http://www.senate.gov/resources/images/us_sen.ico\n",
       "keywords                                                       []\n",
       "authors                                                        []\n",
       "canonical_link                                                   \n",
       "title           U.S. Senate: U.S. Senate Roll Call Votes 111th...\n",
       "meta_data       {'viewport': 'width=device-width, initial-scal...\n",
       "movies                                                         []\n",
       "publish_date                                                 None\n",
       "source                                      http://www.senate.gov\n",
       "summary                                                          "
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "display_df(\n",
    "    pd.DataFrame(article_json.items(), columns=['key', 'value']).set_index('key'),\n",
    "    rows=13)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "There are many fields in the JSON file, but for this analysis we only look at a\n",
    "few that are primarily related to the content of the article: the article's title, text content, URL, and publication date.\n",
    "We create a data frame where each row represents one article (the granularity in a news story).\n",
    "To do this, we load in each available JSON file as a Python dictionary, and then\n",
    "extract the fields of interest to store as a `pandas` `DataFrame` named `df_raw`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 309,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "from pathlib import Path\n",
    "\n",
    "def df_row(content_json):\n",
    "    return {\n",
    "        'url': content_json['url'],\n",
    "        'text': content_json['text'],\n",
    "        'title': content_json['title'],\n",
    "        'publish_date': content_json['publish_date'],\n",
    "    }\n",
    "\n",
    "def load_json(folder, label):\n",
    "    filepath = folder / 'news content.json'\n",
    "    data = df_row(json.loads(filepath.read_text())) if filepath.exists() else {}\n",
    "    return {\n",
    "        **data,\n",
    "        'label': label,\n",
    "    }\n",
    "\n",
    "fakes = Path('data/politifact/fake')\n",
    "reals = Path('data/politifact/real')\n",
    "\n",
    "df_raw = pd.DataFrame([load_json(path, 'fake') for path in fakes.iterdir()] +\n",
    "                      [load_json(path, 'real') for path in reals.iterdir()])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 310,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>url</th>\n",
       "      <th>text</th>\n",
       "      <th>title</th>\n",
       "      <th>publish_date</th>\n",
       "      <th>label</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>dailybuzzlive.com/cannibals-arrested-florida/</td>\n",
       "      <td>Police in Vernal Heights, Florida, arrested 3-...</td>\n",
       "      <td>Cannibals Arrested in Florida Claim Eating Hum...</td>\n",
       "      <td>1.62e+09</td>\n",
       "      <td>fake</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>https://web.archive.org/web/20171228192703/htt...</td>\n",
       "      <td>WASHINGTON — Rod Jay Rosenstein, Deputy Attorn...</td>\n",
       "      <td>BREAKING: Trump fires Deputy Attorney General ...</td>\n",
       "      <td>1.45e+09</td>\n",
       "      <td>fake</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                                 url   \n",
       "0      dailybuzzlive.com/cannibals-arrested-florida/  \\\n",
       "1  https://web.archive.org/web/20171228192703/htt...   \n",
       "\n",
       "                                                text   \n",
       "0  Police in Vernal Heights, Florida, arrested 3-...  \\\n",
       "1  WASHINGTON — Rod Jay Rosenstein, Deputy Attorn...   \n",
       "\n",
       "                                               title  publish_date label  \n",
       "0  Cannibals Arrested in Florida Claim Eating Hum...      1.62e+09  fake  \n",
       "1  BREAKING: Trump fires Deputy Attorney General ...      1.45e+09  fake  "
      ]
     },
     "execution_count": 310,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_raw.head(2)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Exploring this data frame reveals some issues we'd like to address before we begin the analysis. For example:\n",
    "\n",
    "* Some articles couldn't be downloaded. When this happened, the `url` column contains `NaN`.\n",
    "* Some articles don't have text (such as a web page with only video content). We drop these articles from our data frame.\n",
    "1.*Unix epoch), so we need to convert them to `pandas.Timestamp` objects.\n",
    "* We're interested in the base URL of a web page. However, the `source` field in the JSON file has many missing values compared to the `url` column. So we must extract the base URL using the full URL in the `url` column. For example, from `dailybuzzlive.com/cannibals-arrested-florida/` we get `dailybuzzlive.com`.\n",
    "* Some articles were downloaded from an archival website (`web.archive.org`). When this happens, we want to extract the actual base URL from the original by removing the `web.archive.org` prefix.\n",
    "* We want to concatenate the `title` and `text` columns into a single `content` column that contains all of the text content of the article.  "
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can tackle these data issues using a combination of `pandas` functions and\n",
    "regular expressions:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 314,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "import re\n",
    "\n",
    "# [1], [2]\n",
    "def drop_nans(df):\n",
    "    return df[~(df['url'].isna() |\n",
    "                (df['text'].str.strip() == '') | \n",
    "                (df['title'].str.strip() == ''))]\n",
    "\n",
    "# [3]\n",
    "def parse_timestamps(df):\n",
    "    timestamp = pd.to_datetime(df['publish_date'], unit='s', errors='coerce')\n",
    "    return df.assign(timestamp=timestamp)\n",
    "\n",
    "# [4], [5]\n",
    "archive_prefix_re = re.compile(r'https://web.archive.org/web/\\d+/')\n",
    "site_prefix_re = re.compile(r'(https?://)?(www\\.)?')\n",
    "port_re = re.compile(r':\\d+')\n",
    "\n",
    "def url_basename(url):\n",
    "    if archive_prefix_re.match(url):\n",
    "        url = archive_prefix_re.sub('', url)\n",
    "    site = site_prefix_re.sub('', url).split('/')[0]\n",
    "    return port_re.sub('', site)\n",
    "\n",
    "# [6]\n",
    "def combine_content(df):\n",
    "    return df.assign(content=df['title'] + ' ' + df['text'])\n",
    "\n",
    "def subset_df(df):\n",
    "    return df[['timestamp', 'baseurl', 'content', 'label']]\n",
    "\n",
    "df = (df_raw\n",
    " .pipe(drop_nans)\n",
    " .reset_index(drop=True)\n",
    " .assign(baseurl=lambda df: df['url'].apply(url_basename))\n",
    " .pipe(parse_timestamps)\n",
    " .pipe(combine_content)\n",
    " .pipe(subset_df)\n",
    ")"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "After data wrangling, we end up with the following data frame named `df`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 315,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>timestamp</th>\n",
       "      <th>baseurl</th>\n",
       "      <th>content</th>\n",
       "      <th>label</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>2021-04-05 16:39:51</td>\n",
       "      <td>dailybuzzlive.com</td>\n",
       "      <td>Cannibals Arrested in Florida Claim Eating Hum...</td>\n",
       "      <td>fake</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2016-01-01 23:17:43</td>\n",
       "      <td>houstonchronicle-tv.com</td>\n",
       "      <td>BREAKING: Trump fires Deputy Attorney General ...</td>\n",
       "      <td>fake</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "            timestamp                  baseurl   \n",
       "0 2021-04-05 16:39:51        dailybuzzlive.com  \\\n",
       "1 2016-01-01 23:17:43  houstonchronicle-tv.com   \n",
       "\n",
       "                                             content label  \n",
       "0  Cannibals Arrested in Florida Claim Eating Hum...  fake  \n",
       "1  BREAKING: Trump fires Deputy Attorney General ...  fake  "
      ]
     },
     "execution_count": 315,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.head(2)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 316,
   "metadata": {
    "tags": [
     "remove-cell"
    ]
   },
   "outputs": [],
   "source": [
    "# Hidden cell, stores data for following sections.\n",
    "df.to_csv('data/fake_news.csv', index=False)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now that we've loaded and cleaned the data, we can proceed to exploratory\n",
    "data analysis."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": []
  }
 ],
 "metadata": {
  "celltoolbar": "Tags",
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.11"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
